Skip to content

feat!: multilingual text-to-speech#1134

Merged
IgorSwat merged 31 commits into
mainfrom
@is/multilingual-tts
May 20, 2026
Merged

feat!: multilingual text-to-speech#1134
IgorSwat merged 31 commits into
mainfrom
@is/multilingual-tts

Conversation

@IgorSwat
Copy link
Copy Markdown
Contributor

@IgorSwat IgorSwat commented May 8, 2026

Description

Introduces major changes to the text-to-speech module based on Kokoro model, including:

  • Multilingual text-to-speech - a set of complete pipelines & voices for different languages. A complete list of (currently) supported languages can be found below.
  • Improved phonemization & speech quality - utilizing neural phonemization model as a fallback for the old lexicon-base phonemization significantly improves speech quality, particularly for non-standard, out of dictionary words.
  • Timestamp-based audio cutting - an improve postprocessing algorithm, eliminates artifacts introduced by .pte model, resulting in cleaner, more natural speech.
  • API changes: prepared for voice-cloning & custom, fine-tuned versions of Kokoro model.

Supported language current status:

  • 🇺🇸 American English: ✅
  • 🇬🇧 British English: ✅
  • 🇫🇷 French: ✅
  • 🇪🇸 Spanish: ✅
  • 🇵🇹/🇧🇷 Portugese: ✅
  • 🇮🇹 Italian: ✅
  • 🇵🇱 Polish: ✅
  • 🇩🇪 German: ✅
  • 🇮🇳 Hindi: ✅
  • 🇯🇵 Japanese: ❌ (coming soon)
  • 🇨🇳 Mandarin Chinese: ❌ (coming soon)

Introduces a breaking change?

  • Yes
  • No

There are 2 major breaking changes introduced by this PR:

  • Changed "synthezation from phonemes" API.

    Old API:

     const audioData = await tts.forwardFromPhonemes({
       phonemes:
         'ɐ mˈæn hˌu dˈʌzᵊnt tɹˈʌst hɪmsˈɛlf, kæn nˈɛvəɹ ɹˈiᵊli tɹˈʌst ˈɛniwˌʌn ˈɛls.',
     });
    

    New API:

    const audioData = await tts.forward({
      text:
        'ɐ mˈæn hˌu dˈʌzᵊnt tɹˈʌst hɪmsˈɛlf, kæn nˈɛvəɹ ɹˈiᵊli tɹˈʌst ˈɛniwˌʌn ˈɛls.',
       phonemize: false,  # Disables phonemization and treats text as phonemes
    });
    
  • Changed predefined model - voice setups. Now both model files & voice/phonemization files are bundled together, due to languages like Polish or German having fine-tuned model weights.

    Old API:

    const model = useTextToSpeech({
      model: KOKORO_MEDIUM,
      voice: KOKORO_VOICE_AF_HEART,
    });
    

    New API:

    const model = useTextToSpeech(KOKORO_AMERICAN_ENGLISH_FEMALE_HEART);
    

Type of change

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Documentation update (improves or adds clarity to existing documentation)
  • Other (chores, tests, code style improvements etc.)

Tested on

  • iOS
  • Android

Testing instructions

Play around demo speech apps.

Unit tests for RNE-specific code will be added later on.
Phonemis package has it's own, wide range of unit tests implemented (see Phonemis repo)

Screenshots

Related issues

#712

Checklist

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation accordingly
  • My changes generate no new warnings

Additional notes

@IgorSwat IgorSwat requested review from chmjkb and msluszniak May 8, 2026 14:24
@IgorSwat IgorSwat force-pushed the @is/multilingual-tts branch from 8380a2a to eb999a7 Compare May 8, 2026 14:26
@IgorSwat IgorSwat self-assigned this May 8, 2026
@IgorSwat IgorSwat added feature PRs that implement a new feature improvement PRs or issues focused on improvements in the current codebase labels May 8, 2026
@IgorSwat IgorSwat changed the title feat: multilingual text-to-speech feat!: multilingual text-to-speech May 8, 2026
Copy link
Copy Markdown
Member

@msluszniak msluszniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also update the code in documentation and documentation in general. Also address lint warnings, there are plenty of them that you need to add to cspell ignore.

Comment thread packages/react-native-executorch/react-native-executorch.podspec Outdated
@msluszniak
Copy link
Copy Markdown
Member

Also if this PR adds breaking change, please describe it directly below Introduces a breaking change? section in PR body.

Comment thread apps/speech/components/ModelPicker.tsx
Comment thread apps/speech/screens/TextToSpeechScreen.tsx
Comment thread apps/speech/screens/TextToSpeechScreen.tsx Outdated
Comment thread packages/react-native-executorch/android/src/main/cpp/CMakeLists.txt Outdated
Comment thread packages/react-native-executorch/src/constants/tts/voices.ts
@msluszniak msluszniak linked an issue May 18, 2026 that may be closed by this pull request
5 tasks
msluszniak and others added 3 commits May 19, 2026 11:31
…e type aliases

TypeDoc emits `export type` declarations under `06-api-reference/type-aliases/`,
not `06-api-reference/interfaces/`. The links in useTextToSpeech.md pointed at
the interfaces/ paths, which never get generated for these names, breaking the
Docusaurus build (`onBrokenLinks: 'throw'`).
@IgorSwat IgorSwat force-pushed the @is/multilingual-tts branch from 10e8e1c to 38340f6 Compare May 19, 2026 11:32
- tests/CMakeLists.txt: build phonemis from source (add_subdirectory)
  and propagate its include dir to rntests_core. The previous IMPORTED
  STATIC pointed at a libphonemis.a that nothing builds.
- FrameTransformTest, ObjectDetectionTest, InstanceSegmentationTest:
  update bbox member access for #1130's BBox refactor
  (.x1/.y1/.x2/.y2 → .p1.x/.p1.y/.p2.x/.p2.y).
- PoseEstimationTest: keypoint type became float in #1130; update the
  static_assert from int32_t to float.
- FrameTransformTest: make the three Right_* tests platform-aware.
  Production inverseRotateBbox/inverseRotatePoints are a no-op on
  Android for Right (front-cam upright portrait); rotateFrameForModel
  rotates CW on Android vs CCW on iOS. Tests now have #if defined(__APPLE__)
  branches matching production.
- SpeechToTextTest: GTEST_SKIP TranscribeReturnsValidChars with a TODO —
  known-failing on this branch, needs separate investigation.
- run_tests.sh: fix two stale Hugging Face URLs (fsmn-vad and
  yolo26n-pose filenames had changed upstream, causing wget to 404 and
  silently abort the script).
Copy link
Copy Markdown
Member

@msluszniak msluszniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure that iOS is also tested since I don't have any for testing.

@barhanc barhanc self-requested a review May 19, 2026 15:18
Copy link
Copy Markdown
Contributor

@barhanc barhanc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested the Speech app on iOS and standard usage is fine. There are however some lurking bugs, in Text to Speech app trying to generate speech using 'AF Heart' from text written in hindi alphabet (e.g. 'शुभ प्रभात') crashes the whole app with offset + count out of range in audio = audio.subspan(0, lastTokenTimestamp); in rnexecutorch::models::text_to_speech::kokoro::Kokoro::synthesize

@msluszniak
Copy link
Copy Markdown
Member

I've tested the Speech app on iOS and standard usage is fine. There are however some lurking bugs, in Text to Speech app trying to generate speech using 'AF Heart' from text written in hindi alphabet (e.g. 'शुभ प्रभात') crashes the whole app with offset + count out of range in audio = audio.subspan(0, lastTokenTimestamp); in rnexecutorch::models::text_to_speech::kokoro::Kokoro::synthesize

Uuu, good catch. have you tried other characters specific for a language like ü, etc.?

@barhanc
Copy link
Copy Markdown
Contributor

barhanc commented May 19, 2026

I've tested the Speech app on iOS and standard usage is fine. There are however some lurking bugs, in Text to Speech app trying to generate speech using 'AF Heart' from text written in hindi alphabet (e.g. 'शुभ प्रभात') crashes the whole app with offset + count out of range in audio = audio.subspan(0, lastTokenTimestamp); in rnexecutorch::models::text_to_speech::kokoro::Kokoro::synthesize

Uuu, good catch. have you tried other characters specific for a language like ü, etc.?

Yeah, I tried the same for german-, french-, and spanish-specific characters and there wasn't any problem.

@IgorSwat
Copy link
Copy Markdown
Contributor Author

I've tested the Speech app on iOS and standard usage is fine. There are however some lurking bugs, in Text to Speech app trying to generate speech using 'AF Heart' from text written in hindi alphabet (e.g. 'शुभ प्रभात') crashes the whole app with offset + count out of range in audio = audio.subspan(0, lastTokenTimestamp); in rnexecutorch::models::text_to_speech::kokoro::Kokoro::synthesize

Fixed.

@msluszniak
Copy link
Copy Markdown
Member

@IgorSwat inspired by Bartek's finding I'm trying some other other attack to expose some problem. Will come back with my finds.

@msluszniak
Copy link
Copy Markdown
Member

msluszniak commented May 19, 2026

TTS edge-case findings from stress testing

Ran a battery of inputs against Kokoro::generate (via forward()) and the streaming path on Android. Bartek's exact crash is no longer reproducible — e173e9d94 fixes it. The following are still open.

1. speed parameter has no validation

speed result
0 throws bare std::exception (no message, no error code)
NaN / Infinity / -1 / 1e9 silently accepted; emits ~5500 samples regardless of text
1e-6 emits 150 945 samples (≈6.3 s) for "Hello world"

1e-6 is the most worrying — audioLength = kTicksPerDuration * effectiveDuration is int32_t (Kokoro.cpp:349); a small enough speed overflows that and the synthesizer allocates unbounded memory. Suggested guard: reject non-finite or ≤ 0 speeds at the JS boundary and in Kokoro::generate, with a real RnExecutorchError(InvalidUserInput, …).

2. Streaming worker hangs on non-EOS content

Kokoro.cpp:171-189:

size_t chunkSize = (eosIt != inputTextBuffer_.rend())
                       ? std::distance(eosIt, inputTextBuffer_.rend())
                       : 0;

if (chunkSize > 0 ||
    streamSkippedIterations >= params::kStreamMaxSkippedIterations) {
  input = inputTextBuffer_.substr(0, chunkSize);   // chunkSize still 0
  inputTextBuffer_.erase(0, chunkSize);            // erases nothing
  streamSkippedIterations = 0;                     // reset, loop forever
}

When streamInsert content has no end-of-sentence character, the buffer never drains. The skip-threshold force-flush path fires but uses chunkSize=0 to extract the chunk, so it produces an empty input and resets the counter. streamStop(false) then waits forever for the buffer to empty. streamStop(true) is the only recovery.

Repros: streamInsert('a'), streamInsert('hello world'), 2000× U+200D — all permanently hang the worker.

Suggested fix:

if (chunkSize > 0) {
  // normal flush by EOS
} else if (streamSkippedIterations >= params::kStreamMaxSkippedIterations) {
  input = inputTextBuffer_.substr(0, searchLimit);
  inputTextBuffer_.erase(0, searchLimit);
  streamSkippedIterations = 0;
} else {
  streamSkippedIterations++;
}

3. streamStop(true) drops in-flight audio silently

Kokoro.cpp:137-145:

auto nativeCallback = [this, callback](const std::vector<float> &audioVec) {
  if (this->isStreaming_) {        // false after streamStop(true)
    this->callInvoker_->invokeAsync(...);
  }
};

If streamStop(true) lands while a chunk is mid-synthesis, the synthesizer finishes the chunk and then the callback no-ops — the audio is generated and discarded with no signal. In a captioning / live-narration context that's a silently lost sentence.

Suggested fix: deliver the chunk that completed before the stop, or surface "aborted with in-flight chunk discarded" through onEnd/onCancel.

4. Observational (optional): solo punctuation produces ~0.5 s of audible artifacts

Single-character inputs like ., !, ?, ... produce 12k–17k samples of mostly-silent audio that contains low-amplitude artifacts the model emits while filling the duration predictor's window. stripAudio's silence threshold doesn't catch them, so the user hears a faint click/breath. Not a crash, not strictly wrong — flagging as observed behavior in case it's worth a guard later (e.g. early-return when all content phonemes are punctuation).


Reproducing

Stress-test version of the speech app lives on branch @ms/tts-stress-tests — the only changed file is apps/speech/screens/TextToSpeechScreen.tsx (preset chip rows + force-stop button + a switch from model.stream() to model.forward() so the hook's silent period-append doesn't mask the actual model behavior).

git fetch origin
git checkout @ms/tts-stress-tests
yarn install
cd apps/speech && yarn android
adb logcat -c && adb logcat ReactNativeJS:V AndroidRuntime:E DEBUG:V libc:E '*:F'

Open the app → Text To Speech screen.

Top row — "Test presets" drives forward() (one-shot synthesis, no streaming wrapper):

  • space, spaces, newline, dot, excl, q, ... — punctuation/empty edge cases (relates to finding 4)
  • Hindi, Arabic, Chinese, Japanese, Hebrew, Russian, Korean — script-mismatch with English voice (Bartek's family ;p)
  • emoji, emoji-mix, ZW-chars, NUL, EN+Hindi, diacritics — non-vocab / mixed-script
  • speed=0, speed=NaN, speed=Inf, speed=-1, speed=1e-6, speed=1e9 — drives finding 1
  • noPh:EN, noPh:nums, noPh:symsphonemize: false with non-phoneme input

Bottom row — "Streaming tests" drives streamInsert + stream() directly (bypassing the hook's . append):

  • no-term:a, no-term:long — finding 2; these will hang the worker (tap the red Force stop button to recover)
  • many-EOS — sanity check (multiple sentences in one insert)
  • insert-flood-EOS — concurrency / buffer growth under load (no race observed)
  • race:stop-during-synth — finding 3
  • race:insert-during-synth — sanity check (no data loss observed)

Tap Force stop any time a streaming test hangs — it calls streamStop(true) so you can keep testing.

Interpreting the logs. Each tap emits one of:

I ReactNativeJS: [TTS-test]   text=<json> speed=<n> phonemize=<bool>
I ReactNativeJS: [TTS-test]   forward() returned <N> samples
I ReactNativeJS: [TTS-test]   threw: <message>

I ReactNativeJS: [TTS-stream] start: <label>
I ReactNativeJS: [TTS-stream] <label> chunk #<n>: <N> samples (t=<ms>)
I ReactNativeJS: [TTS-stream] end: <label> — <chunks> chunks, <samples> samples, <ms>ms
I ReactNativeJS: [TTS-stream] threw: <label> -> <message>

Quick decoder:

  • returned 0 samples → safe no-op (input had no usable phonemes).
  • returned <large N> with weird input → check whether the model produced unintended audio (findings 1 / 4).
  • threw: std::exception with no detail → wrap with proper RnExecutorchError somewhere upstream.
  • start: <label> followed by no chunks and a multi-minute duration → streaming worker is hung (finding 2); you need Force stop.
  • start: <label> followed by no chunks but quick end (≲1 s) → in-flight chunk dropped (finding 3) — synthesizer ran, callback no-op'd.
  • chunk #N lines mean audio was delivered to JS. The streaming sanity tests (many-EOS, insert-flood-EOS, race:insert-during-synth) should each produce several chunks.

Copy link
Copy Markdown
Member

@msluszniak msluszniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment above

@IgorSwat
Copy link
Copy Markdown
Contributor Author

IgorSwat commented May 20, 2026

@msluszniak

  1. I added checks for speed parameter
  2. I know this issue for a long time. The thing is, for normal (one-shot) inputs it never happens, cause we always add a dot '.' if no EOS character is present at the end. And I once changed it in exact the same way as your "reviewer" suggested, but then the LLM streaming mode often fails, because it's very hard to adjust the kStreamMaxSkippedIterations parameter to not stop the streaming prematurely, because it's dependent on LLM generation speed. So fixing it properly is a non-trivial task, and I don't think we have time for that.
  3. Harmless issue, I don't see a point in complicating the API further for something like that.
  4. Another harmless issue. I don't see a point in covering all the "stupid" inputs from user, unless it results in a direct crash.

@IgorSwat IgorSwat force-pushed the @is/multilingual-tts branch from 81c1766 to 81daf0d Compare May 20, 2026 08:14
@msluszniak
Copy link
Copy Markdown
Member

I agree that 4 is rather good to skip.
Regarding 3, ok I just wanted to be sure you are aware of this buggy behaviour.

Regarding 2:

cause we always add a dot '.'

Where do we add this dot?

@IgorSwat
Copy link
Copy Markdown
Contributor Author

IgorSwat commented May 20, 2026

Where do we add this dot?

useTextToSpeech.ts, lines 107 to 112.

@msluszniak
Copy link
Copy Markdown
Member

Where do we add this dot?

useTextToSpeech.ts, lines 107 to 112.

Ok, what about textToSpeechModule? I cannot see the similar hack.

@IgorSwat
Copy link
Copy Markdown
Contributor Author

Ok, what about textToSpeechModule? I cannot see the similar hack.

We can't do that in textToSpeechModule. And I honestly do not see a point in doing so, if it is already done in the hook.

@msluszniak
Copy link
Copy Markdown
Member

msluszniak commented May 20, 2026

Hmmm, the very minimum we need to do is to escalate this into separate issue since it is a serious problem. After 0.9 release I will work on the solution that both solves this issue and do not break llm integration.

And by the way, hook is completely separate mechanism, if we have this hack in module that is re-used by the hook, then ok. But other way around, I disagree.

Copy link
Copy Markdown
Member

@msluszniak msluszniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said, either me or you need to work on the eos issue, but except that, I have no other things to add. Great job overall! :))

@IgorSwat IgorSwat merged commit 9f752b6 into main May 20, 2026
5 checks passed
@IgorSwat IgorSwat deleted the @is/multilingual-tts branch May 20, 2026 09:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature PRs that implement a new feature improvement PRs or issues focused on improvements in the current codebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Text to Speech - add new languages support

4 participants